Amazon Reviews - Building Recommendation System

Comparison of Collaborative Filtering, Popularity-based Models for Amazon Electronics Dataset

Domain: E-commerce

Data Description:

Data columns: First three columns are userId, productId, and ratings and the fourth column is timestamp. You can discard the timestamp column as in this case you may not need to use it.

Source:

Amazon Reviews data (http://jmcauley.ucsd.edu/data/amazon/) The repository has several datasets. For this case study, we are using the Electronics dataset.

Learning Outcomes:

Objective:

To make a recommendation system that recommends at least five new products based on the user's habits.

  1. Read and explore the given dataset. ( Rename column/add headers, plot histograms, find data characteristics) ( 3 Marks)
  2. Take a subset of the dataset to make it less sparse/ denser. ( For example, keep the users only who has given 50 or more number of ratings ) -(5 Marks)
  3. Build Popularity Recommender model. ( 15 marks)
  4. Split the data randomly into a train and test dataset. ( For example, split it in 70/30 ratio) ( 2 marks)
  5. Build Collaborative Filtering model. ( 20 marks)
  6. Evaluate the above model. ( Once the model is trained on the training data, it can be used to compute the error (like RMSE) on predictions made on the test data.) You can also use a different method to evaluate the models. ( 5 marks)
  7. Get top - K ( K = 5) recommendations. Since our goal is to recommend new products to each user based on his/her habits, we will recommend 5 new products. ( 10 marks)
  8. Summarise your insights. ( 10 marks)

Project Roadmap:

Step 1: Import the necessary Libraries

Step 2: Load the dataset

Step 3: Exploratory Data Analysis & Feature Engineering

Step 4: Split the dataset into Train & Test set for Model Building

Step 5: Build Popularity Based Recommender Model

Step 6: Build Collaborative Filtering Based Recommender Model

- Matrix Factorization method - Singuar Value Decomposition
- KNN With Means Model (User-user based)
- KNN With Means Model (Item-item based)
- Cross Validation, Hyperparameter Tuning & Model Evaluation for all the Models

Step 7: Comparison of Recommendations from different Models

- Comparison of RMSE Score 
- Comparison of top N recommendations for a sample user by all the three models

Step 8: Learnings and Summary

Step 1: Import the necessary Libraries

Step 2: Load the dataset

Dataframe columns are in the order userId, productId, ratings and timestamp. Let us name the columns.

Step 3: Exploratory Data Analysis & Feature Engineering

Checking the datatypes and null records

Observations:

Let us convert the timestamp column to datetime format.

Extracting the year info from the timestamp

Ratings over the year - Trend Chart

Observations:

No of Ratings per user

Observations:

Extracting a subset of dataset with No of Ratings > 50

Rechecking the distribution of Ratings post the subset operation (checking sparsity)

Observations:

Exploring ratings variable

Distribution of each ratings

Observations:

Exploring userId variable

Users with most number of ratings

Exploring productId variable

Summary of Unique Users, Products and Total Ratings

Step 4: Split the dataset into Train & Test set for Model Building

Step 5: Build Popularity Based Recommender Model

Instantiating the PopularityRecommender Class

Creating the Recommendations for Train set

Generating Predictions for the Test set

Fetching Recommendations for the Random user 'A3BY5KCNQZXV5U'

Fetching Recommendations for another Random user 'A3T7V207KRDE2O'

--> We can see that the same set of Products are Recommended to all the Users since the Popularity Based Recommender System doesn't consider user-specific information for recommendations.

Evaluating Popularity Recommender Model

Step 6: Build Collaborative Filtering Based Recommender Model

Matrix Factorization using Singular Value Decomposition (SVD) Recommender Model

Loading from Dataframe into Surprise Dataset
Visualising the top 5 records from Surprise Dataframe
Train Test split
Singular Value Decomposition from Surprise package

Hyperparameter Tuning using GridSearchCV (Surprise package) for SVD Model

Cross-Validation for SVD Model

Fitting the SVD Model with Best Hyperparameters

Evaluating SVD Recommender Model

Top N Recommendations from SVD Recommender Model

Fetching Recommendations for the Random user 'A3T7V207KRDE2O'

It is the same user we fetched recommendations from Popularity Based Recommender Model

Comparing the SVD Recommendations with that of Popularity Based Model for same user

Observations:

User-User Nearest Neighbor Collaborative Filtering

Hyperparameter Tuning using GridSearchCV (Surprise package) for KNN (user) Model

Cross-Validation for KNN (user) Model

Fitting the KNN (user) Model with Best Hyperparameters

Evaluating User-based KNN Recommender Model

Top N Recommendations from User-based KNN Recommender Model

Fetching Recommendations for the Random user 'A3T7V207KRDE2O'

Comparing the KNN (user) Recommendations with that of SVD Model for same user
Comparing the KNN (user) Recommendations with that of Popularity Based Model for same user

Observations:

Item-Item Nearest Neighbor Collaborative Filtering

Taking a further subset of data since Item-item Collaborative Filtering in Surprise package will encounter Memory Allocation issues for large datasets (scalability issues of CF Model)

Item-item based KNN With Means Model

Note: Unable to do GridSearch and Cross Validation for Item-based Model due to memory constraints.

Evaluating User-based KNN Recommender Model

Top N Recommendations from User-based KNN Recommender Model

Step 7: Comparison of Recommender Model Performance

Step 8: Learnings and Summary

Insights:

**End of Assignment**